-
Notifications
You must be signed in to change notification settings - Fork 16
Add embedding-based detector #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,37 @@ | |||
# Embedding Classification Detector | |||
|
|||
# Setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it could be useful to state that local Python must match up Python in the Containerfile? At present, python 3.9 will be downloaded inside the container, which may warrant upgrading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be moved to a shared utils
folder so that other detectors that require training can use this class ?
|
||
sys.path.insert(0, os.path.abspath("")) | ||
# from common.scheme import TextDetectionHttpRequest, TextDetectionResponse | ||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete duplicate import
Adds a framework for defining detections based on a text-embedding classifier. The default configuration here uses MMLU as the training data for the classification and creates a multi-label text classifier to infer which of the 61 MMLU subjects a particular body of text belongs to. The detector endpoint then accepts the following arguments:
contents
: List of texts to classifyallowList
: Allowed list of subjects: all inbound texts must belong to at least one of these subjects to avoid flagging the detectorblockList
: Blocked list of subjects: all inbounds texts must not belong to any of these subjects to avoid flagging the detector.threshold
: Defines the maximum distance a body of text can be from the subject centroid and still be classified into that subject. The default value is 0.75, while athreshold
of >10 will classify every document into every subject. As such, values 0<threshold
<1 are recommended.